{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Configuring analyzers for the MSMARCO Document dataset\n",
    "\n",
    "Before we start tuning queries and other index parameters, we wanted to first show a very simple iteration on the standard analyzers. In the MS MARCO Document dataset we have three fields: `url`, `title` and `body`. We tried just couple very small improvements, mostly to stopword lists, to see what would happen to our baseline queries. We now have two indices to play with:\n",
    "\n",
    "- `msmarco-doument.defaults` with some default analyzers\n",
    " - `url`: standard\n",
    " - `title`: english\n",
    " - `body`: english\n",
    "- `msmarco-document` with customized analyzers\n",
    " - `url`: english with URL-specific stopword list\n",
    " - `title`: english with question-specfic stopword list\n",
    " - `body`: english with question-specfic stopword list\n",
    "\n",
    "The stopword lists have been changed:\n",
    " 1. Since the MS MARCO query dataset is all questions, it makes sense to add a few extra stop words like: who, what, when where, why, how\n",
    " 1. URLs in addition have some other words that don't really need to be searched on: http, https, www, com, edu\n",
    " \n",
    "More details can be found in the index settings in `conf`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import importlib\n",
    "import os\n",
    "import sys\n",
    "\n",
    "from elasticsearch import Elasticsearch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# project library\n",
    "sys.path.insert(0, os.path.abspath('..'))\n",
    "\n",
    "import qopt\n",
    "importlib.reload(qopt)\n",
    "\n",
    "from qopt.notebooks import evaluate_mrr100_dev"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# use a local Elasticsearch or Cloud instance (https://cloud.elastic.co/)\n",
    "es = Elasticsearch('http://localhost:9200')\n",
    "\n",
    "# set the parallelization parameter `max_concurrent_searches` for the Rank Evaluation API calls\n",
    "max_concurrent_searches = 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparisons\n",
    "\n",
    "The following runs a series of comparisons between the baseline default index `msmarco-document.default` and the custom index `msmarco-document`. We use multiple query types just to confirm that we make improvements across all of them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Query: combined per-field `match`es"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def combined_matches(index):\n",
    "    evaluate_mrr100_dev(es, max_concurrent_searches, index, 'combined_matches', params={})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Evaluation with: MRR@100\n",
      "Score: 0.2403\n",
      "CPU times: user 2.26 s, sys: 615 ms, total: 2.87 s\n",
      "Wall time: 4min 44s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "combined_matches('msmarco-document.defaults')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Evaluation with: MRR@100\n",
      "Score: 0.2504\n",
      "CPU times: user 2.12 s, sys: 639 ms, total: 2.76 s\n",
      "Wall time: 3min 34s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "combined_matches('msmarco-document')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Query: `multi_match` `cross_fields`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def multi_match_cross_fields(index):\n",
    "    evaluate_mrr100_dev(es, max_concurrent_searches, index,\n",
    "        template_id='cross_fields',\n",
    "        params={\n",
    "            'operator': 'OR',\n",
    "            'minimum_should_match': 50,  # in percent/%\n",
    "            'tie_breaker': 0.0,\n",
    "            'url|boost': 1.0,\n",
    "            'title|boost': 1.0,\n",
    "            'body|boost': 1.0,\n",
    "        })"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Evaluation with: MRR@100\n",
      "Score: 0.2475\n",
      "CPU times: user 2.29 s, sys: 732 ms, total: 3.02 s\n",
      "Wall time: 4min 10s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "multi_match_cross_fields('msmarco-document.defaults')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Evaluation with: MRR@100\n",
      "Score: 0.2683\n",
      "CPU times: user 2.13 s, sys: 709 ms, total: 2.84 s\n",
      "Wall time: 4min 26s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "multi_match_cross_fields('msmarco-document')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Query: `multi_match` `best_fields`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def multi_match_best_fields(index):\n",
    "    evaluate_mrr100_dev(es, max_concurrent_searches, index,\n",
    "        template_id='best_fields',\n",
    "        params={\n",
    "            'tie_breaker': 0.0,\n",
    "            'url|boost': 1.0,\n",
    "            'title|boost': 1.0,\n",
    "            'body|boost': 1.0,\n",
    "        })"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Evaluation with: MRR@100\n",
      "Score: 0.2714\n",
      "CPU times: user 2.16 s, sys: 731 ms, total: 2.89 s\n",
      "Wall time: 4min 9s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "multi_match_best_fields('msmarco-document.defaults')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Evaluation with: MRR@100\n",
      "Score: 0.2873\n",
      "CPU times: user 2.14 s, sys: 641 ms, total: 2.78 s\n",
      "Wall time: 4min 27s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "multi_match_best_fields('msmarco-document')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "As you can see, there's a measurable and consistent improvement with just some minor changes to the default analyzers. All other notebooks that follow will use the custom analyzers including for their baseline measurements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
